{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Assignment_2\n",
    "\n",
    "## 问题\n",
    "- Data: W3C HTML5 中文兴趣小组一个月的邮件存档, 格式mbox\n",
    "- 试定位邮件中的签名档, 并尽可能提取多的字段\n",
    "- 编码的问题, 邮件的编码不统一.\n",
    "\n",
    "\n",
    "## 思考\n",
    "- 是不是应该考虑先来一波暴力提取\n",
    "    - 确实思路是层层递进, 先粗略来一波, 慢慢精确到想得到的信息.\n",
    "- 很多乱码, 涉不涉及转码?\n",
    "- 每封邮件的起止位置标识是什么?\n",
    "    - 答: From ....@......\n",
    "- 如何尽可能多的提取信息?\n",
    "    - 如何让正则通用性更高\n",
    "- 新思路: 用分词提取所有姓名, 然后根据每一条邮件内容提取签名档\n",
    "- 因为提取的姓名格式有几个形式不同, 比如 中英混杂, 名字中间带括号.\n",
    "    - 所以考虑先生成一份处理过的name_list, 然后, 编译这份name_list到正则表达式.\n",
    "    - 但是这么做应该会导致程序比较慢, 尤其是当name_list长度特别大时. \n",
    "    - 可以通过观察name_list, 发现里面很多条目是重复的, 因为邮件互相往来嘛... 本着'最小'的理念, 先试试!\n",
    "- 另外一个想法:\n",
    "    - 对全文进行分词, 提取人名, 加之前通过Flanker提取的人名得到一份名单.\n",
    "- 签名档没有固定格式, 感觉找不到一个统一的规则去提取签名档...心塞!\n",
    "    - 难道要写一个巨能容错的正则(规则复杂, 冗长)去匹配, 然后在筛选?\n",
    "\n",
    "\n",
    "## 过程\n",
    "- 根据mbox格式分开每一封邮件\n",
    "- 利用flanker print 邮件内容, 观察是否有一般的提取模式\n",
    "    - 期间遇到了编码问题, 有NoneType, 以及str, 还有Unicode.\n",
    "- 发现无法利用内容提取, 想到利用发件人姓名\n",
    "- 如何提取发件人姓名?\n",
    "    - 利用flanker在header里面找到'From', 可以提取发件人姓名, 邮箱.\n",
    "\n",
    "\n",
    "## References\n",
    "- [mbox-wikipedia](https://en.wikipedia.org/wiki/Mbox)\n",
    "- "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import os\n",
    "import re\n",
    "from nltk.tag import StanfordNERTagger\n",
    "import flanker\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "from flanker import mime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Add models of NLTK  \n",
    "os.environ[\"CLASSPATH\"] = \"/Users/xpgeng/Library/stanford-ner-2015-12-09\"  \n",
    "os.environ[\"STANFORD_MODELS\"] = \"/Users/xpgeng/Library/stanford-ner-2015-12-09/models\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Add Tagger\n",
    "st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def prepare_data(filename='2013-11.mbx'):\n",
    "    with open(filename, 'r') as f:\n",
    "        data = f.read()\n",
    "    f.close()\n",
    "    data_list = filter(None, re.split(r'From\\s([\\w+.?]+@(\\w+\\.)+(\\w+))', data))  #  \n",
    "    # Here I have to add twice for-loop, I haven't analyse the reason\n",
    "    for data in data_list:\n",
    "        if len(str(data)) < 500:\n",
    "            data_list.remove(data)\n",
    "    for data in data_list:\n",
    "        if len(str(data)) < 500:\n",
    "            data_list.remove(data)\n",
    "    return data_list "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data_list = prepare_data('2013-11.mbx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def create_name_list(data_list):\n",
    "    name_list = []\n",
    "    p = re.compile(ur'\\\"?([\\w\\s\\(\\)]+|[\\x80-\\xff]+)\\\"?\\s<')\n",
    "    for message_string in data_list:\n",
    "        msg = mime.from_string(message_string)\n",
    "        for item in msg.headers.items():\n",
    "            if item[0] == 'From':\n",
    "                name = p.search(item[1].encode('utf-8')).group(1)\n",
    "                name_list.append(name)\n",
    "    name_list = list(set(name_list))\n",
    "    name_list += ['Cindy', 'Kenny', 'Chen Yijun', 'Chunming', '-ambrose']\n",
    "    name_list.remove('com')\n",
    "    name_list.remove(' Chunming')\n",
    "    name_list.remove(' Bobby Tung')\n",
    "    name_list.remove('Hawkeyes Wind')\n",
    "    return name_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_signature(message_string, name_list):\n",
    "    signature_list = []\n",
    "    for name in name_list:\n",
    "        p_name = re.compile(r'^%s.+'%name, re.MULTILINE | re.DOTALL)\n",
    "        msg = mime.from_string(message_string)\n",
    "        for part in msg.parts:\n",
    "            if not isinstance(part.body, (type(None), str)):\n",
    "                if p_name.findall(part.body.encode('utf-8')):\n",
    "                    signature_list += p_name.findall(part.body.encode('utf-8'))\n",
    "    signature = None\n",
    "    for item in signature_list:\n",
    "        if len(item) < 300: \n",
    "            signature = item # 已经知道小于300的就一个\n",
    "    if not signature:\n",
    "        return None\n",
    "    elif 'Hawkeyes Wind' in signature or 'Zhiqiang' in signature: # 只能不断添加规则...\n",
    "        return None\n",
    "    elif '<' in signature:\n",
    "        soup = BeautifulSoup(item, 'html.parser')\n",
    "        signature = soup.get_text()\n",
    "        return signature\n",
    "    else:\n",
    "        return signature\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "name_list = create_name_list(data_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Chen Yijun,\r\n",
      "Twitter: @ethantw\r\n",
      "\r\n",
      "\n",
      "-ambrose <http://gniw.ca>\n",
      "\n",
      "\n",
      "\n",
      "Bobby Tung\r\n",
      "Mobileï¼š+886-975068558\n",
      "bobbytung@wanderer.tw\r\n",
      "Webï¼šhttp://wanderer.tw\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Bobby Tung\r\n",
      "Mobileï¼š+886-975068558\n",
      "bobbytung@wanderer.tw\r\n",
      "Webï¼šhttp://wanderer.tw\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "Zi Bin Cheah\n",
      "HTML5 Chinese IG chair\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "signature_list = []\n",
    "for data in data_list:\n",
    "    signature_list.append(extract_signature(data, name_list))\n",
    "signature_list = filter(None, signature_list)\n",
    "signature_list = set(signature_list)\n",
    "for item in signature_list:\n",
    "    print item"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
